logo_edn

Data mining in biosciences
M2 DMBS
2024-2025
Marie-Joe Karam Kassar

Open Reading Frame (ORF) detection in a partial human transcriptome assembly

1. Objective

In this project, you will extract Open Reading Frames (ORFs) from a partial human transcriptome assembly provided in the Homo_sapiens_cdna_assembled.fasta file. You will set up a Docker container to conduct the analysis in a controlled environment.

2. Prerequisites

Before proceeding, ensure you have:

3. Project steps

Extract ORFs in Python

Save Results

Validate ORFs with BLAST+

#! /bin/bash
wget https://ftp.ncbi.nlm.nih.gov/blast/db/v5/swissprot.tar.gz
tar -xvf swissprot.tar.gz
#! /bin/bash
# 9606 is the taxid for human
blast? –db /path/to/swissprot –query /path/to/multifastafile –taxids 9606 –outfmt 7 –out /path/to/output/file.tsv

Plot ORF Size Distribution

Estimating the Mean Length of Open Reading Frames (ORFs) Using a Geometric Distribution

The length of an ORF can be modeled using a geometric distribution, where each codon represents a trial, and encountering a stop codon is considered a "success". Assuming that each of the 64 possible codons in the genetic code is equally likely, p is the probability of encountering a stop codon in a single trial. Using the geometric distribution, determine the expected number of codons (trials) until a stop codon is encountered. Recall that for a geometric distribution, the expected number of trials until the first success is E[X] = 1/p.

3. Project Setup

Your analysis must be run within a docker container.

Build a Docker Image

Dockerfile:

FROM python:3.12-slim

RUN apt-get update && \
    apt-get install -y wget libgomp1 

WORKDIR /app

RUN wget https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.16.0+-x64-linux.tar.gz && \
    tar -xzvf ncbi-blast-2.16.0+-x64-linux.tar.gz && \
    mv ncbi-blast-2.16.0+ /usr/local/ && \
    rm ncbi-blast-2.16.0+-x64-linux.tar.gz

ENV PATH="/usr/local/ncbi-blast-2.16.0+/bin:$PATH"

RUN wget https://ftp.ncbi.nlm.nih.gov/blast/db/v5/swissprot.tar.gz && \
    mkdir -p /db/swissprot && mv swissprot* /db/swissprot && \
    tar -xzvf /db/swissprot/swissprot.tar.gz -C /db/swissprot

COPY . /app

blastp -db /db/swissprot/swissprot -query /analysis/test.fasta -taxids 9606 -outfmt 7 -out test.out

To build your Docker image and test it by running a container interactively, follow these steps:

docker build -t blast:v2.16.0 .

Then go to your working directory containing your scirpt and input file and run a container as follows:

docker run -it --rm -v /path/to/your/workingdire:/analysis blast:v2.16.0 bash

4. Submission

Your final submission should be submitted on github with at least these files:

Version Control with Git